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ABSTRACT 

We present a simple mathematical criterion for determining whether a given statistical model does not describe several independent 
sets of measurements, or data modes, adequately. We derive this criterion for two data sets and generalise it to several sets by using 
the Bayesian updating of the posterior probability density. To demonstrate the usage of the criterion, we apply it to observations of 
exoplanet host stars by re-analysing the radial velocities of HD 217107, Gliese 581, and u Andromedae and show that the currently 
used models are not necessarily adequate in describing the properties of these measurements. We show that while the two data sets 
of Gliese 581 can be modelled reasonably well, the noise model of HD 217107 needs to be revised. We also reveal some biases in 
the radial velocities of v Andromedae and report updated orbital parameters for the recently proposed 4-planet model. Because of the 
generality of our criterion, no assumptions are needed on the nature of the measurements, models, or model parameters. The method 
we propose can be applied to any astronomical problems, as well as outside the field of astronomy, because it is a simple consequence 
of the Bayes' rule of conditional probabilities. 
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1. Introduction 

Since the discovery of the firs t clear-cut example of an extrasolar 
planet orbiting a normal star (Ma yor & Ouelozl [l995). Doppler- 
spectroscopy, or radial velocity (RV), has been the most efficient 
method in detecting extrasolar planets orbiting nearby starfl 

Because the same nearby stars can be targets of several RV 
surveys, there is the possibility to combine the information of 
two or mo re RV data sets using the mea ns of Bayesian infer- 
ence (e.g. iGregorvl I201 it iTuomiL I201 ll) and posterior updat- 
ing. However, little is known about the possible biases indi- 
vidual data sets, or RV timeseries, may contain with respect to 
one another. Therefore, we use Bayesian tools in determining 
whether the common statistical models can be used to analyse 
RV timeseries without bias, and if not, how these models can 
be improved to receive trustworthy results. For these purposes 
we introduce a method for determining model inadequacy in de- 
scribing multiple sets of measurements - the Bayesian model 
inadequacy criterion. 

The Bayes' rule leads naturally to the common l y use d 
Bayesian model comparison methods (e.g. Jeffrey!, |1961|) . 
These methods can be used efficiently to compare the relative 
performance of different statistical models of some a priori se- 
lected model set. The Bayes' rule can be used to calculate the rel- 
ative posterior probabilities of the models in the set given some 
measurements that describe some aspect of the modelled system. 
However, because only the relative performances of the models 
can be compared, it cannot be said whether the model with the 



greatest posterior probability is adequately accurate in describ- 
ing the measured quantities. 



* The corresponding author, e-mail: m.tuomi@herts.ac.uk; 
mikko . tuomi Outu . f i 

1 See The Extrasolar Planets Encyclopaedia for an up-to-date list of 
known planetary candidates: http://exoplanet.eu/. 



The Baye s' fa ctors dKass & Raftervl 119951: 

IChib & JeliazkovL 12001b iFord & Gregorvi l2007h and other 
related measures of model goodn e ss, su c h as the v arious 
information criter i a (e.g . lAkaikel 119731 : ISchwarzl 1 19781 : 
Snie gelhalter et all l2002h derived using different approxima- 
tions, can only be used to tell which one of the models in some 
model set describes the measurements the best - i.e. the relative 
"goodness" of the models can be determined reliably. However, 
they cannot be used to assess whether this best model is as 
accurate description as possible given the information in the 
measurements. Our method of determining model inadequacy 
in this sense can be used to assess whether the model set can be 
estimated to contain a sufficiently accurate model that can be 
used to describe the measurements reliably. 

Whether a given statistical model can be used to describe 
several sets of data in an adequate manner or not, has not 
been studied very extensively in the statistics literature. In 
iKaasalainenl d201 ll) . the author presents a method for deter- 
mining the optimal combination of two or more sources of 
data, or data modes. However, we are not aware of a single 
study discussing th i s prob lem in the Bayesian context, though 
Spie gelhalter et alj d2002l) appear to discuss the "model ade- 
quacy" in their article introducing the deviance information cri- 
terion, but they use the term interchangeably with the term "fit". 
Yet, determining whether a single model can describe two or 
more data sets without bias is of increasing importance in as- 
tronomy, particularly for indirect detections of the most interest- 
ing exoplanets whose signals lie close to the current limits of 
instrument sensitivity. 
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Since the RV variations of typical targets of Doppler- 
spectroscopy surveys are commonly modelled using a superpo- 
sition of Keplerian signals, reference velocities, and possible lin- 
ear trends, corrupted by some Gaussian noise, we use these mod- 
els as a starting point of our analyses. However, we emphasise 
the fact that the RV variations caused by the stellar surface, usu- 
ally referred to as the stellar jitter, a re in general, d espite some 
efforts in modelling their magnitude (rWrightU2005l) . foreseen as 
arisin g from dark or bright spots primarily dr iven by stellar rota- 
tion dBarnes et all 1201 1[ iBoisse et all 1201 ll) and their effect on 
the RV's is not understood very well at the moment. Therefore, 
we model the excess noise in the RV's with care and show ex- 
plicitly the statistical models we use in the analyses. 

In section 2 we describe what we mean by the model inade- 
quacy in describing two or more independent measurements or 
sets of measurements and provide a simple way of determining 
it in practice. We describe the details of our model inadequacy 
criterion in the Appendix. Finally, in section 3 we apply this cri- 
terion in practice by analysing astronomical RV exoplanet de- 
tections made using at least two different telescope-instrument 
combinations. 



2. Bayesian analyses and model inadequacy 

The Bayesian methods do not differentiate between determin- 
ing the most probable parameter values or most probable mod- 
els containing these parameters. They can all be arranged into 
a linear order, which yields information on the observed system 
if only the selected models describe the observed system real- 
istically enough. It is possible to calculate the relative posterior 
probabilities of any number of models and determine their rela- 
tive magnitudes in a similar way as it is possible to determine the 
posterior odds of having the measurements drawn from a prob- 
ability density characterised by a certain parameter value of any 
one of the models. We do not describe the process of determin- 
ing the posterior probability densities of the model parameters 
here, because several well-known posterior sampling methods 
exist and they have been well covered by the existing literature 
(e.g. M etropolis et al 
1984: lHaario etall 200 1 ) 



1953; Hastingsl 119701; iGeman & Gemanl 
The performance of these methods 



has also been demonstrated by several re-anal yses of ex isting 



RV data, revealing the existe nce of planets ("e.g. iGregoryi 2005 



2007aj|bf iTuomi & K otiranta. 2009) or disputing it (e.g. Tuomi, 
2011). In these works, the model probabilities have played an 
important role in assessing the number of planetary companions 
orbiting nearby stars. 

Commonly, the Bayesian tools are used to assess the prob- 
abilities of different statistical models given the measurements 
m that are being analysed using the models. These tools provide 
the relative probabilities of the selected models At/, / = 1, 
in the a priori determined model set as 



P(At/|m) 



P(m\Mi)P(Mj) 
Z k j=1 P(m\Mj)P(Mjy 



(1) 



where probabilities P(Al,), / = 1, k are the prior probabilities 
of the different models and the marginal likelihoods P(m|At,) are 
defined as 



P(m\Md 



(2) 



£0, 



where tt(0/) is the prior probability density of the parameter or 
parameter vector of the model At, and l(m\8j) represents the 
likelihood function corresponding to the model. 



The interpretation of the posterior probabilities in Eq. ([TJ is 
a rather subjective matter because they are relative and it is only 
possible to assess how much confidence one has in one of the 
models c ompa r ed to the rest of them. A ccording to the views of 
Ueffrevsl (fl96lt) : iKass & Raftervl (fl995h . a model would have to 
be at least 150 times more probable than the next best model to 
have strong evidence in favour of it. We adopt the same thresh- 
old because claiming that there are k + 1 planets orbiting a star 
instead of k needs to be on a solid ground with respect to the 
model probabilities. Especially, if the model with k + 1 planets 
was e.g. 50 times more probable than that with k planets, there 
would still be a roughly 2% possibility that the k planet model 
explains the data. Therefore, we choose a rather high threshold 
when interpreting the posterior probabilities of models with dif- 
ferent numbers of Keplerian signals. 

With the marginal likelihoods available according to the Eq. 
[2] we define the model Al to be an inadequate description of 
independent measurements m,,/ = 1, N, if it holds that for 
some small positive number r 



B(m\, ...,m N \M) : = 



P(m\, ...,m N \M) 
P(mi\M) 



< r. 



(3) 



This definition is based on the independence of the measure- 
ments and that they are being modelled with a single statistical 
model. It is a simple result of a relation of the marginal likeli- 
hoods of each of the measurement and the joint marginal likeli- 
hood of all of them shown in Eq. (IA.81 >. We derive this criterion 
using the Bayes' rule of conditional probabilities and the con- 
cept of independence, and also interpret the results in terms of 
information theory in the Appendix. 

The number r has an interpretation as a threshold value. For 
instance, the model being inadequate with probabilities 90%, 
95%, and 99% corresponds to threshold values of 0.1 1 1, 0.053, 
and 0.010, respectively (see Appendix). Therefore, if the best 
model according to Eq. (Q3 satisfies Eq. (01 for some reasonably 
small r, it can be concluded that the model does not describe the 
measurements without bias and the corresponding analysis re- 
sults may be biased as well. In such a case, the model set has to 
be re-considered and expanded by adding better descriptions of 
the data to it. In practice, we use the 95% threshold value, but 
choosing its value is a subjective issue and only represents how 
confidently one wants to determine the model inadequacy. 

We note that the model inadequacy can also be interpreted in 
terms of the measurements being inconsistent with one another 
with respect to the model used. This interpretation arises from 
the fact that the model may not take into account some features 
in one or more data sets that result from biases in the process 
of making the measurements or from some other unmodelled 
features in the data. We use the inadequacy of the model given 
the data sets and the inconsistency of the data sets with respect 
to this model interchangeably throughout this article. 

We describe the parameter probability densities using 
three numbers. These numbers are the maximum a poste- 
riori (MAP) estimate of the posterior density and the lim- 
its of the 99% Bayesi a n cr edibility set D0.99 as defined in 
e.g. ITuomi & Kotirantal (l2009h . We calculate these estimates 
from the posterior densities of the model parameters received 
using the adaptive Metropolis posterior sampling algorithm 
(Haari o et all 1200 ll) . which is a modification of the famous 
Metropolis-Has tings (M-H) algorithm dMetropolis et all 11953b 
lHastingsl fl970) that adapts the proposal density to the shape of 
the posterior density of the model parameters. Because of this 
property, it is not very sensitive to the choise of initial parame- 
ter vector nor proposal density - desired features that make the 
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method significantly more robust than the common M-H algo- 
rithm by enabling a more rapid convergence to the posterior. 

While the adaptive Metropolis algorithm assumes a Gaussian 
proposal density, it adapts to the posterior reasonably rapidly and 
a samples of roughly 10 6 are sufficient for the chain burn-in pe- 
riod in all the analyses, i.e. until the chain converges to the poste- 
rior. Because of the Gaussian posterior, the acceptance rate of the 
chain can sometimes decrease to as low values as 1 % when the 
posterior density is highly nonlinear, as is comonly the case with 
RV data. However, in such cases, we simply increased the chain 
length by a factor of 10-20 and saved computer memory by only 
saving every 10th or 20th member of the chain to the output file. 
We verified that the chain had indeed converged by running up to 
five samplings with different initial values and required that they 
all produced marginal integrals that were equal up to the second 
digit. With a converged chain, we then calculated the m arginal 
likelihoods using the method o f lChib & Jeliazkovl d200lh . 

For the sake of trustworthiness, throughout this article we 
also take into account the uncertainties in the stellar masses 
when calculating the semi-major axes and RV masses of the 
planets orbiting them. These uncertainties are taken into account 
by using a direct Monte Carlo simulation - i.e. by drawing ran- 
dom values from both the density of the model parameters and 
the estimated density of the stellar mass when calculating the 
densities of the semi-major axes and planetary RV masses. We 
assume that the estimated distribution of the stellar mass, usu- 
ally reported using mean and stardard error, is independent of 
the densities of the orbital parameters from the posterior sam- 
plings. 



3. Model inadequacy criterion and exoplanet 
detections 

In principle, analysing RV data is reasonably simple because 
the planet induced stellar wobble can be modelled using the 
well-known Newtonian laws of motion - especially if the grav- 
itational planet-planet interactions are not significant in the 
timescale of the observations and post-Newtonian effects are 
negligible. In practice, though, there are several aspects of the 
RV measurements that are not understood well enough to be able 
to consider that the models describe the measurements in an ad- 
equate manner. These aspects include e.g. disturbances caused 
by undetected plan ets or planets whose o rbital periods cannot 
be constrained (e.g. lFord & Gregory! 12007); noise caused by the 
inhomogeneities in the stel l ar surf ace, usually referred to as the 
stellar "jitter" (e.g. lWrig ht. 2005); and excess noise and possi- 
ble biases that are particular to the various instruments and tele- 
scopes used to make the observations. All these aspects make 
the analyses of RV's challenging and if not accounted for prop- 
erly by the statistical models used, can lead to biased results and 
misleading interpretations. 

In this section we re-analyse three RV data sets made using 
at least two telescope-instrument combinations. Assuming these 
sets are independent - which is a common assumption, though 
not explicitly stated most of the time when analysing several sets 
of measurements - we apply the model inadequacy criterion to 
find out if the common models should be modified and if the 
corresponding results are different from the ones found in the 
literature. 



Table 1. The relative model probabilities of k planet models for 
the combined data set of HD 217107. 
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3.1. HD 217107 



The RV's of HD 217107 are known to contain the signatures 

~ il [1991 l200ll IVogt et al. " 
2001 " Wittenmver et al. 



of tw o extrasolar p lanets (iFischer et al 
20001 iNaef et all [200 it IVogt et all 

2007; I Wright et all l2009t) . The system consists of a massive 
short-period planet with an orbital period of roughly 7 days, 
and an outer long-period planet with an orbital period of 11 
years. The RV's of this target have been observed using 4 in- 
strum ents mounted on 5 telesc opes, namely, Euler (Naef et afl, 
1200 11). Harlam J. Sm ith (HJS) dWittenmver etafl 120071) . Keckl 
(Wri ght et alll2009l) . and Shane and Coude Auxiliary Telescope 
(CAT) at the Lick Observatory dWright et all l2Q09h . Together, 

there are 293 RV measurements of this sys tem. 

The most up-to-date solution is that of IWright et all d2009l) . 
where the combined Keck and Lick data with 207 measurements 
was analysed. However, the authors do not discuss the exact sta- 
tistical model used in their analyses and therefore we feel that 
this combined data set should be re-analysed to see how the four 
data sets should be modelled to receive the most trustworthy re- 
sults. 

Fo llowing t he common Bayesian approach (e .g. Gregory, 
120051 I2007afbt iTuomi & Kotirantal 120091; iTuomil |2011|) . we 
choose our model set to consist of four models, namely, models 
Mk,k = 0, 3, where k denotes the number of planetary signals 
in the data. Therefore, there are 5k + 5 parameters in our models 
corresponding to 5 parameters for each planet - RV amplitude 
K, orbital eccentricity <?, orbital period P, longitude of pericentre 
u>, and mean anomaly Mo, i.e. the date of periastron passage as 
expressed in radians between and 2n - four parameters decrib- 
ing the reference velocities of each data set ji,l - 1, ...,4, and 
the parameter describing the magnitude of stellar jitter crj. Our 
set of statistical models describing the measurement m,j made at 
time f; is 



r k (ti) + yi + ei + ej,k = 0,...,3, 



(4) 



where represents the k Keplerian signals and e, and ej are 
Gaussian random variables with zero mean and known vari- 
ance err and an unknown variance cr^, respectively. The variance 
crj corresponds to the instrument uncertainty of each individual 
measurement, which is usually assumed known and is reported 
together with the data. 

We analyse the combined data set using the models Mk, k - 
0, ...,3 and receive the model probabilities in Table Q] These 
probabilities imply that there are two companions orbiting the 
star with high confidence. However, the Bayes factor determin- 
ing the inadequacy of the best model in the model set has to be 
calculated to assess the reliability of this model. Denoting the 
four data sets as mi, I - 1, 4, we receive B(m\, m.4) = 0.05, 
which means that the model is an inadequate description of the 
data with a probability of 0.95. This implies, that the model set 
does not contain a sufficiently good model, i.e. the data sets are 
not consistent with one another given this model, and needs to 
be expanded. 
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Table 2. The relative model probabilities of k planet models Mk 
and Mi,k for the combined data set of HD 217107 (all 8 proba- 
bilities are on the same scale). 



k 


W) 


P(M IM ) 





< i(r ilib 


< \Q- m 


1 


< 10-" 


< 1(T % 


2 


< io- 14 


1.00 


3 


< io- 17 


< IO" 2 



Because of the inadequacy of model M2, we no longer as- 
sume that the instrument noise is known according to the vari- 
ances cr? but suspect that there could be unknown random vari- 
ations or biases that differ between the data sets. Therefore, we 
expand our model set by models 



n(ti) + 7i + S + eij, k= 1, .., 3, 



(5) 



where the Gaussian random variable e/j is different for every 
data set and is assumed to consist of additional random varia- 
tion caused by the instrument noise and stellar jitter. Therefore, 
in this model, the resulting values 07 / can only be interpreted 
as giving the upper limit for the stellar jitter. We denote these 
models as Mis- 
using the expanded model set, we receive the model prob- 
abilities in Table [2] These probabilities imply that there are in- 
deed differences in the noise levels of the different data sets and 
that these differences have to be taken into account when assess- 
ing the orbital parameters of the planets. We calculate the model 
inadequacy Bayes factor B(m\, ...,1114) for the best model Mi,2- 
This time B(nt\, ...,1114) = 3.3 x 10 12 , which corresponds to an 
inadequacy probability of 3.0 x IO -13 , a value that clearly states 
the best model cannot be considered inadequate. 

We have listed the solution of the model with the greatest 
posterior probability, M1.2, in Table[3] While consistent with the 
results of Wri ght et al] (l2009h . our solution with the best model 
Mi,2 has much more uncertain parameter values, especially for 
the period, RV mass, and RV amplitude of the outer companion, 
which is also found heavily correlated with the reference veloc- 
ity parameters. We show the 99%, 95%, and 50% equiprobability 
contours of RV mass and period of the outer companion in Fig. 
Q](the gap in the 50% contours arises from the numeri cal inaccu- 
racy o f the plot). This Fig. is similar to the Fig. 8 in Wrigh tet alJ 
(2009), but they used the^- 2 density for the plot instead of pos- 
terior density. Also, we note that the jitter of HD 217107 has a 
level of at most 6.0 ms 1 based on the noise in the Euler data, 
which turned out to contain the least noise out of the four data 
sets. It is also interesting to see that the Lick data had therefore at 
least 5 ms _1 , but possibly even more than 10.0 ms -1 , additional 
uncertainty that can only be caused by the telescopes and the in- 
strument. Therefore, it cannot be said that the Lick instrument 
uncertainty is known according to the standard uncertainties of 
the data reduction pipeline, as reported when publishing Lick 
RV's. This could in fact be one of the reasons the parameter val- 
ues in our solution (Table [3]) a ppear to be more uncertain than 
those reported by Wright et al. (2009), though they do not indi- 
cate the confidence-level of the reported uncertainties. 

3.2. Gliese 581 

The Gliese 581 planetary system has been cl aimed to be a host to 
as many as six relatively low-mass planets (Bonfils e t all l2005t 
lUdrvetall 120071 iMavoretall l2009t IVoet et al.L|2010ir Though 
the most likely number of planetary companions in the system is 
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Fig. 1. The equiprobability contours of the RV mass and orbital 
period of HD 217107 c containing 50%, 95%, and 99% of the 
probability density. 

Table 4. The relative model probabilities of k planet models Mk 
and Mi t k for the combined data set of Gliese 581. 
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four dTuomil 1201 ll) or five dGregorvt 1201 ll) . the RV's of Gliese 
58 1 provide a challenging analysis problem because the signals 
are only barely distinguishable from the relatively noisy mea- 
surements. 

We start by analysing the combin ed data se t of HARPS and 
HIRE S RV measure ments (see e.g. IVogt et all 1201 Ot [Gregory, 
1201 lb iTuomil 1201 ll) using the models Mk and Mi,k with k = 
0, ...,5. We choose this model set because we already suspect, 
based on the analysis of the RV's of HD 217107, that this com- 
bined data set may have different noise levels corresponding to 
the different telescope-instrument combinations. 

The posterior probabilities of the models in our model set are 
shown in Table [4] These probabilities, while having the greatest 
value for model Mi, 5, do not support the conclusion that there 
are five Keplerian signals in the data strongly enough because the 
probability of model vVt/,4 is highly significant. Therefore, we 
check the inadequacy of the latter model to see if our statistical 
model is good enough. 

The Bayes factor in Eq. © has a value of 2.0 x 10 10 for the 
four-planet model At/,4, which means that the probability of the 
HIRES and HARPS data sets being inadequately described by 
the model is 5.0 x 10~ n , a value low enough to conclude that 
there is no need to revise the model. We note that this model, 
an order of magnitude m ore probable than the previously used 
model M4 dTuomil 1201 ll) . does not result in a revision of the 
orbital parameters (Table |5j. However, the noise parameters of 
the two data sets do differ from one another slightly. Denoting 
the HIRES data set with I = 1 and the HARPS data set with / = 
2, the parameters 07/, / = 1,2, have MAP estimates of 2.39 and 
1.50 ms _1 , respectively. The corresponding 99% credibility sets 
are [1.77, 3.09] and [1.00, 2.01] ms -1 , respectively. Therefore, 
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Table 3. The two-planet solution o f HP 2 1 7 1 07 combined data set. The MAP estimates of the parameters and in brackets the limits 
of their Do.gg sets. The solution of Wright et al. (2009) is shown for comparison for the corresponding parameters as reported by 
them. 



Parameter 


M/,2 


M/.2 


Wright et al. 


(2009) 




Planet b 


Planet c 


Planet b 


Planet c 


P [days] 


7.12664 [7.12674, 7.12692] 


4300 [3800, 6000] 


7.126816(39) 


4270(220) 


e 


0.123 [0.111,0.139] 


0.49 [0.39, 0.58] 


0.1267(52) 


0.517(33) 


K [ms- 1 ] 


138.3 [136.0, 140.1] 


31.5 [25.0,60.4] 


139.20(92) 


35.7(1.3) 


to [rad] 


0.39 [0.29, 0.52] 


3.38 [3.12, 3.82] 






M [rad] 


4.97 [4.85, 5.08] 


1.44 [0.63, 1.80] 






m p sin i [M e ] 


1.35 [1.22, 1.47] 


2.6 [1.8, 5.4] 


1.39(11) 


2.60(15) 


a[AU] 


0.0742 [0.0701, 0.0771] 


5.3 [4.7, 6.6] 


0.0748(43) 


5.32(38) 


71 [ms -1 ] (Euler) 


6.6 [-12.7, 14.2] 








y 2 [ms- 1 ] (HJS) 


11.0 [-6.9, 19.0] 








73 [ms -1 ] (Keck) 


-0.8 [-19.9, 4.9] 








74 [ms -1 ] (Lick) 


-1.2 [-19.9,5.0] 








ct/,1 [ms -1 ] (Euler) 


2.7 [0.0, 6.0] 








(Ti',2 [ms- 1 ] (HJS) 


4.8 [1.1, 8.4] 








<T/3 [ms- 1 ] (Keck) 


5.4 [4.4, 6.4] 








07,4 [ms- 1 ] (Lick) 


12.9 [10.9, 15.4] 









the noise in the HARPS measurements gives an upper limit for 
the jitter of Gliese 581 of 2.01 ms -1 , whereas there is likely a 
small amount of additional instrument noise in the HIRES data. 



3.3. v Andromedae 

The RV's of v And have shown three strong Keplerian 
signa ls resulting from three massive planets orbiting the 
star 



als resulting from three massive pla nets orbiting th 
( Butler et all I1997L 11999b IFischer et al.L 120031: iNaef et al 



l2004t IWittenmver et all 120071: IWright et all l2009h . The 
star has been a ta rget of five RV surveys for s everal 
years, namely, Lic k dButler et all 119991: IFischer et all 120031: 
IWright et all |2009), the Advanced Fiber-Optic Echelle spec- 
trome ter (A FOE) at the Whipple O bservatory dButler et all 
1999), HJS (IWittenmver et all 120 07). ELODIE at the Haute- 
Provence Observatory (Nae fetalll20 04). and the Hobby-Eberly 
Telescope (HET) ( Mc Arthur et aL | | 2010[). Recently , the com- 
bined data of Lick (Fischer et all 120031: IWright et all 120091) and 
ELODIE ( Nae f et alll2004l) has been reported to contain a fourth 
planetary signal (ICuriel et alll201 ll) . 

We re-analyse the combined RV data of v And by using the 
model inadequacy criterion. However, before we st art, we check 
the co nsistency of the 248 Lick RV's published in Fischer et al.1 
(120031) and the 284 Lick RV's published in lWright et al.l (120091) 
(we d enote these data set s as Lickl and Lick2, respectively), be- 
cause (CurieTetal] d201 ll) used Lick2 data and the additional 30 
RV points from Lickl that were not included in Lick2. The fact 
that these 30 measurements were not included in Lick2 likely be- 
cause of suspected biases or calibration errors suggests that there 
could be some bias es within the combined Lick data analysed in 
ICuriel et all d2011l) as well. 

The Lickl and Lick2 data sets appear to have one strik- 
ing difference. While they both imply that there are indeed 
four Kepler i an sig nals in the v And RV's, as concluded by 
ICuriel et"al] d2011l) . they do not agree on the orbital period of 
the proposed fourth signal. The probability of the three com- 
panion model is significantly lower than that of the four planet 
model - 10 -4 and 10~ 24 times lower for Lickl and Lick2, re- 
spectively. This implies that there is either a fourth Keplerian 
signal in the data or biases that mimic Keplerian periodicity. The 
MAP estimate and the corresponding ©0.99 set of the period of 
this fourth signal is 3120 [2560, 3940] days for Lickl data and 



3860 [3180, 5160] for Lick2 data. The latte r of these es t imates 
appears to be very close to the estimate of ICuriel et al.l d201 ll) 
of 3848.86+0.74 days. However, because of the difference of 
more than 700 days between the MAP estimates of the periods 
from Lickl and Lick2, we cannot conclude, based on the Lick 
data alone, that there are indeed four Keplerian signals in the 
data. This inconsistency is seen the most clearly when looking 
at the equiprobability contours of the parameter posterior densi- 
ties given each data set. The contours containing 50%, 95%, and 
99% of the density are shown in Fig.|2]for the period and ampli- 
tude parameters of u And d (top) and the proposed u And e (bot- 
tom). The Lickl contours are shown in red and Lick2 contours 
in blue. As seen in this Fig., the estimated period and amplitude 
of the v And d differ also between the two Lick data sets. 

Because of t h e inco nsiste ncy of the Lick data sets published 
in IFischer et ail (120031) and IWright et all d2009l) . we use the 
model inadequacy criterion to find out if either of these two data 
sets is also inconsistent with the combined ELODIE, AFOE, 
HET, and HJS data. We denote this combined data as m and 
use 171 1 and mi to denote the Lickl and Lick2 data, respectively, 
and calculate the Bayes factors B(m, m\) and B(m, mi) for the 
model A4/,4. The logarithms of these factors are 4.01 and -10.20, 
respectively (Table|6]l. This implies that the Lick2 data set is in- 
consistent with the rest of the data and the 4-companion model is 
an inadequate description with a probability of more than 0.999, 
whereas the Lickl data cannot be shown inconsistent with the 
rest of the data with a probability exceeding 5%. Therefore, it 
appears that Lickl data ( Fis cher et all |2003[) is consistent with 
the other four data sets but the Lick2 data jwright et all 120091) 
is not. 

We also investigated whether some of the ELODIE, AFOE, 
HET, HJS, and Lick data sets were inconsistent with the rest of 
the data by calculating the Bayes factors B(m,, m), where ;«,, i = 
1, 5, refers to each of these sets, respectively, and m contains 
all the data except the set m,-. We performed these calculations 
using both Lickl data and Lick2 data. The probabilities of the 
model AI/,4 being inadequate in describing each of these sets 
with respect the the rest of the data are shown in Table [6] 

The results in Table [6] show that while Lick2 data is incon- 
sistent with the rest of the measurements with respect to the 
model M1.4, the AFOE data is also inconsistent with the rest 
of the measurements regardless of using the Lickl or Lick2 data 
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Table 5. The four-planet solution of GJ 581 combined HARPS and HIRES data. The MAP estimates of the parameters and in 
brackets the limits of their D0.99 sets. 



Parameter 


Planet e 


Planet b 


Planet c 


Planet d 


P [days] 


3.1487 [3.1479, 3.1507] 


5.36845 [5.36810, 5.36890] 


12.917 [12.908, 12.926] 


66.88 [66.12, 67.32] 


e 


0.05 [0, 0.38] 


0.005 [0, 0.048] 


0.04 [0, 0.24] 


0.36 [0, 0.65] 


TV [II1S J 


I 7A n OS 7 ^71 
1. /O [l.Uo, Z.j /J 


17 AS N 1 QO 1 ^ 071 
IZ.^fJ [11. 7U, Ij.U/J 


1 7A [1 A7 ^ cm 


I OKI IS 7 S71 
l.OJ [1.1J, Z.JZJ 


oj [rad] 


2.4 [0, 2tt] 


3.9 [0, 2n\ 


2.6 [0, 2tt] 


5.6 [0, 2n\ 


M [rad] 


2.6 [0, 2n] 


2.6 [0, 2tt] 


3.5 [0, 2n] 


4.7 [0, 2tt] 


m p sin i [M e ] 


1.86 [1.14, 2.51] 


15.73 [14.38, 16.95] 


5.51 [4.45, 6.56] 


5.19 [3.36,7.21] 


a [AU] 


0.0284 [0.0275, 0.0294] 


0.0406 [0.0393, 0.0420] 


0.0728 [0.0706, 0.0751] 


0.218 [0.211, 0.226] 


71 [imr 1 ] (HARPS) 


-0.36 [-0.88, 0.12] 








72 [ms- 1 ] (HIRES) 


0.38 [-0.41, 1.17] 








cr u [ms- 1 ] (HARPS) 


1.50 [1.00, 2.01] 








a 12 [ms" 1 ] (HIRES) 


2.39 [1.77, 3.09] 










1270 



1280 



1290 



P 3 [day] 



E 2 




_L 



_L 



_L 



3000 



4000 



5000 



6000 



P 4 [day] 



Fig. 2. The equiprobability contours of the period and amplitude 
parameters of v And d (top) and v And e (bottom) containing 
50%, 95%, and 99% of the probability den sity. The red colour 
denotes the contours given the Lickl data of Fischer et al .1 (120031) 
and blue is used to denote the contours given the Lick2 data of 
IWright et al.l(l2009h . 



among the others in the analyses. We also note that the same 
inconsistency remains for the AFOE data when using the three- 
com panion model M1 .3 in the analyses. Therefore, as also noted 
bv lCuriel et al.l d201 ll) . we conclude that the AFOE data has ad- 
ditional biases and should not be used together with the rest of 
the data because the results would be prone to biases as well. 
To further demonstrate the inconsistency of the AFOE data and 
the other data sets, we show the RV residuals of the AFOE data 
when the three-companion model has been used to analyse the 



Table 6. The log-Bayes factors (log B) and probabilities (P) of 
model AI/ 4 being an inadequate description of each individual 
set of RV's of v And and the rest of the data. The Lickl (Ll) and 
Lick2 (L2) data are analysed separately. 



Set 


logS(Ll) 


logB(L2) 


P(L1) 


P(L2) 


Lick 


4.01 


-10.20 


0.018 


>0.999 


AFOE 


-12.91 


-10.81 


>0.999 


>0.999 


HET 


52.73 


38.89 


< 10- 22 


< 10- 16 


HJS 


10.55 


8.70 


< 10 4 


< 10 -3 


ELODIE 


13.59 


19.36 


< 10" 5 


< 10- 8 



x> o 
'in ld 




500 

t [JD-2450000] 



1000 



Fig. 3. The residuals of AFOE RV's of the u And with the plan- 
etary signals subtracted. 



combined data of AFOE, Lickl, ELODIE, HET, and HJS (Fig. 
|3j. These residuals appear to show a low-amplitude periodicity 
that roughly corresponds to the period of companion d, despite 
the fact that the signal of this companion (and those of b and c) 
has been subtracted. 

We continue the analyses of v And RV's by neglecting the 
AFOE data and by using the older Lickl data set dFischer et all 
I2003I) . because of the inconsistencies of the AFOE and Lick2 
data with the rest of the data sets. The combined data set con- 
sists of Lickl, HET, ELODIE, and HJS data that contain 248, 79, 
71, and 41 measurements, respectively. This combined data set 
with 439 measurements was analysed using two models, namely, 
AI/,3 and AI/,4, because there are clearly three strong Keplerian 
signal s in the data as demonstrated already by iButler et"aTI 
(1999), and because the noise levels of the different data sets 
likely differ from one another based on the previous analyses. 
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Table 7. The log-Bayes factors (log B) and probabilities (P) of 
model Wl/,4 being an inadequate description of v And RV's for 
each individual data set and the rest of the data with the restricted 
data set of Lickl, HET, HJS, and ELODIE. 



Set 


logS 


P 




Lickl 


23.69 


< 10" 


10 


HET 


32.78 


< 10" 


13 


HJS 


16.34 


< 10 


-7 


ELODIE 


18.06 


< 10 


-7 




i i i i i 

-2000 2000 

t [JD-2450000] 



Fig. 4. The Lickl, ELODIE, HJS, and HET RV's with the sig- 
nals of the three inner companion removed. The solid curve rep- 
resents the Keplerian corresponding to planet candidate v And 
e. 



Since we removed the AFOE data from the analyses, we 
need to assess whether the resulting restricted data set can be 
shown inadequate or not. For this purpose, we re-calculate the 
values in Table [6] and show them in Table [7] According to these 
results, none of the four data sets can be said to conflict with 
the others. Also, using the Bayesian model inadequacy for mul- 
tiple data sets by calculating B(m \ , 1114), where m,, i = 1 , 4, 
correspond to Lickl, ELODIE, HJS, and HET data sets, respec- 
tively, we receive a value of 1 .4x 1 9 , which means that these sets 
are inconsistent with a probability of less than 10 9 given the 
four-companion model. Therefore, these four sets can be com- 
bined reliably and we calculate our final solution of v And RV's 
using these four sets. 

The posterior probability of the model AI/3 is less than 10~ 8 
of the probability of model Mia- This implies that there are in- 
deed four periodic signals in the combined data set. The revised 
orbital parameters with respect to the four-companions model 
are shown in Table [8] The RV variations corresponding to the 
longest periodicity in the data are shown in Fig. |4] together with 
the fitted Keplerian signal. The signals of the three inner com- 
panions have been subtracted from the residuals in Fig. [4] 

When comparing the or bital parame t ers of our solution in 
Table [8] with the solution of Curiel etaU(l2011h . it can be seen 
that the period of the v And e is significantly lower in our so- 
lution. We received a MAP estimate for t he orbital pe r iod of 
2860 days (JD0.99 = r2600. 32201). whereas ICuriel et all d201 ll) 
reported a period of 3848.86+0.74 days. This difference can 
arise f rom the fact t hat the y used the more recent Lick2 data 
set of IWright et all d2009t) . which is not consistent with the 
other RV's according to our analyses. We also found another 
solution for the period of v And e. This period is 5750 days 



(£>n.99 = [5220, 6610]), roughly twice the MAP periodicity, but 
its posterior probability is more than a thousand times lower than 

that of the solution in Table [8] 

We note that while Curi el et alJ (1201 ll) adopted a jitter of 
10 ms" 1 when ana lysing the RV's of v And, the estimate of 
Butl er et al.l (120061) is only 4.2 ms -1 . Our results are consistent 
with the latter estimate because the upper limit of excess noise, 
including the stellar jitter, is 4.58 ms -1 based on the lowest noise 
level in the data sets of the HET data (Table [8}. According to 
our results, the jitter has likely an even lower value of roughly 
2.0 ms -1 . This also implies that the Lickl, ELODIE, and HJS 
data contain an additional source of RV variations - likely the 
telescope-instrument combination used to measure these data. 

4. Discussion 

We have proposed a simple method for assessing whether a sta- 
tistical model is an inadequate description of multiple indepen- 
dent data sets. This method is simply an application of the well 
known Bayesian model selection theory and the law of condi- 
tional probability but it also differs from the common model se- 
lection approach because it provides the means of determining 
whether a single model, i.e. the best model in the a priori se- 
lected model set, is not an adequate description of the data sets 
and needs to be improved. 

Using this Bayesian model inadequacy criterion and com- 
mon model comparisons, we re-analysed three combined RV 
data sets made using at least two telescope-instrument com- 
binations. According to our results, the Gliese 581 RV's ob- 
served using the HIRES and HARPS spectrographs can be de- 
scribed reliably using the model A4/,4, where their uncertain- 
ties caused by stellar jitter and additional instrument uncertainty 
have been modelled to have different magnitudes - at least, the 
four-companion model cannot be shown to be an inadequate de- 
scription of th ese two data sets. This suggests that the results in 
lTuornil(l20Tlh are indeed reliable in this respect. 

The RV's of HD 207107 showed that there can be 
significant telescope-instrument -induced uncertainties in the 
data. Therefore, we were forced to describe these uncertain- 
ties with different parameters for each telescope-instrument - 
combination. According to our results, the telescope-instrument 
uncertainties can differ considerably between different data sets, 
which makes it more difficult to put reliable constraints to the 
stellar jitter. While the jitter of HD 217107 is not likely to ex- 
ceed 6.0 ms 1 based on the noise in the Euler data, the Lickl 
data turned out to have excess noise of 5-10 ms -1 with respect 
to this jitter estimate (Table |3). Therefore, we conclude that the 
instrument uncertainties cannot be assumed as known, and addi- 
tional noise should always be assumed to exist in the data. When 
neglecting this additional uncertainty, the estimates of orbital pa- 
rameters can be biased and their uncertainty estimates will cer- 
tainly be unrealistically low with respect to the information in 
the measurements. 

The RV's of v Andromedae proved a challenging analysis 
problem on their own. These data consisted of five indepen- 
dent RV data sets. According to our results, the Lick2 data of 
IWright et al.l (|2009) was not consistent with the other data sets 
with respect to o ur model inadequacy criterion but the earlier 
Lickl data set of iFischer et al.l (|2003) should be used instead. 
Unfortunately, it is not possible to tell where this inadequacy 
arises from. Also, the AFOE data (Butl er et all. 1 19991) turned out 
to contradict with the rest of the data, likely because of biases 
in the process of m aking the measurements, as also noted by 
ICuriel et al.l (1201 ll) . This leaved only four consistent data sets, 



8 Mikko Tuomi et al.: Bayesian model inadequacy criterion for multiple radial velocity data sets 

Table 8. The four-planet solution of v Andromedae RV's from Lickl, HET, ELODIE, and HJS. MAP estimates of the parameters 
and the limits of their D0.99 sets. 



Parameter 


Planet b 


Planet c 


Planet d 


Planet e 


P [days] 


4.617098 [4.617047, 4.617174] 


241.50 [241.31, 241.70] 


1278.4 [1271.2, 1285.6] 


2860 [2600, 3220] 


e 


0.022 [0, 0.047] 


0.278 [0.250, 0.311] 


0.307 [0.272, 0.339] 


0.13 [0, 0.28] 


K [imr 1 ] 


71.0 [69.0, 72.7] 


52.8 [51.0, 55.2] 


61.6 [59.1,64.3] 


7.1 [4.9, 9.4] 


oj [rad] 


1.4 [0.0,3.0] 


4.15 [3.99, 4.30] 


4.46 [4.32, 4.62] 


2.6 [0.2,5.1] 


M [rad] 


2.8 [1.4, 4.4] 


3.97 [3.82, 4.11] 


0.29 [0.17, 0.41] 


2.4 [0.0,5.1] 


m p sin i [Mj] 


0.683 [0.617, 0.748] 


1.91 [1.70, 2.09] 


3.85 [3.47, 4.28] 


0.58 [0.40, 0.78] 


a [AU] 


0.0589 [0.0560, 0.0615] 


0.823 [0.783, 0.860] 


2.50 [2.38, 2.62] 


4.27 [3.95, 4.66] 


j\ [ms -1 ] (Lick) 


3.7 [1.3, 6.0] 








y 2 [ms- 1 ] (ELODIE) 


-12.7 [-15.9, -7.9] 








73 [ms- 1 ] (HJS) 


-15.4 [-19.5,-10.8] 








74 [ms- 1 ] (HET) 


-19.4 [-22.2, -16.8] 








o~ / i [ms -1 ] (Lick) 


7.68 [5.89, 9.47] 








en, 2 [ms- 1 ] (ELODIE) 


16.3 [12.2, 20.4] 








Ti',3 [ms- 1 ] (HJS) 


8.6 [1.5, 18.8] 








a- ; ,4 [ms- 1 ] (HET) 


1.95 [0, 4.58] 









Lickl dFischer et all 120031) . ELO DIE dNaef et all 120041). H ET 
(iMcArthur et all l2010l) . and HJS (IWittenmver et all 120071) . to 
be used in the analyses. With differing noise levels for each of 
these sets, we calculated the revised orbital parameters for the v 
And planetary system with four planetary companions (Table|8j. 

Because our four-planet solution of the R V's of v And differs 
significantly from the proposed solution of ICuriel et al.l d201 ll) 
with respect to the orbital period of the outer planet, numerical 
integrations of the orbits are needed to assess the stability of our 
solution. The lower estimate for the orbital period of u And e 
does not support the conclusion that the d and e planets could 
be in a 3:1 mean motion resonance (MMR). However, our so- 
lution coincides roughly with a 2:1 MMR, which could enable 
the stability of the system over long time-scales. Investigating 
the stability of our solution is necessary to be able to determine 
whether it corresponds to a physically viable system and is not 
simply an artefact caused by noise, data sampling, and possible 
biases in the measurements. 

For successful detections, it is crucial that the noise - i.e. all 
the other variations except the Keplerian signals - in the valuable 
measurements is modelled as realistically as possible and not 
simply minimised as is commonly the case when using simple 
X 1 minimisations and related methods. Our method can be used 
readily to detect whether the statistical model indeed describes 
the data adequately with respect to the selected noise model as 
well. 

The application of our criterion to measurements of any 
complex systems is obvious. Systems whose behaviour, time- 
evolution, and dependence on different physical and other fac- 
tors cannot be derived from fundamental physical principles, are 
difficult to model because the models are necessarily empirical 
descriptions, whose validity can only be assessed using mea- 
surements. In such systems, there can be numerous small-scale 
effects and/or biases, whose existence is not known and whose 
magnitude cannot be measured. These effects cannot therefore 
be taken into account in the model constructed to d escribe some 
desired features of the system. As briefly noted in iKaasalainenl 
d201 ll) . the ability to show that a model is an insufficient descrip- 
tion of the measurements is therefore needed to be able to deter- 
mine whether the model needs to be improved further to extract 
all the valuable information from the noisy data. According to 
the demonstrations in this article, our method can be said to sat- 
isfy these needs to significant extent. Also, as we did not make 
any assumptions regarding the exact nature of the model, the cri- 



terion can be applied to any problem for which it is possible to 
calculate the likelihoods of the measurements using the model. 

Finally, we note that if the model has been constructed prior 
to the measurements, the model inadequacy means that the ear- 
lier data sets used to construct the model, i.e. to select the model 
formulae and calculate the posterior densities of the model pa- 
rameters, conflict with the new ones with respect to the model. 
It could also be that the model is being developed using a single 
data set in hand. Then, despite being the best model in the sense 
of having the greatest posterior probability, the model could still 
be inadequate in describing some p art of the data set with respect 
to another part ( Kaasala inenl 1201 ll) . Either way, the measure- 
ments cannot be described adequately using the selected model 
and we say that the model is inadequate. Our criterion can be 
used in these cases as well. 
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Appendix A: Model inadequacy criterion 

A. 1. Two data sets 

We start by defining what we mean by model inadequacy in de- 
scribing two sets of data and derive its equations from the com- 
mon Bayesian model comparison theory. 

We assume that there are independent measurements, or se- 
ries of measurements, m,- : i = 1, ...,N,N > 2, that have been 
made to study the same system of interest. Because these mea- 
surements describe the same system, or at least contain informa- 
tion on the same aspects of the system of interest, they can be 
modelled with statistical models that have at least one parameter 
in common, namely, 9 e 0. Throughout this article the parame- 
ter space © is a bounded subset of R k . The parameter 9 is used to 
quantify some features in the measurements m,, V/. In addition, 
there are other parameters, namely w, 6 : i = 1, ...,7V, that 
each quantify some additional features in the ith measurement. 

The measurements can now be used to compare different sta- 
tistical models using the Bayesian model selction theory. Let 
P(j?l|m,-, mj) be the posterior probability of model J\ given the 
measurements m, and mj. The model can be any model for 
which a likelihood function exists. With this model, both mea- 
surements are modelled using the same parameter 9 and differ- 
ent parameters co, and cjj, respectively. Probability P(S|m,, mj) 
is the corresponding probability when measurements m,- and mj 
are modelled using the same model structure as model J\ has, but 
this time with parameters fa and (fij, where fa, = (9k, cu> k ), k = i, j, 
respectively. 

Therefore, because of the independence of the measurements 
and the independence of fa and <f>j, the marginal integral in Eq. 
(O of the measurements with respect to the model S can be writ- 
ten as3 



P(mi,mj\S) 



= f l(mi,mj\fa,cf>j,!B)n(fa,<[>j\!B)d(fa,<t>j) 

= n f Km k \<t>k,B)n(fa\&)dfa = P(m i \${)P(mj\$t), (A.l) 

where the model has been changed to M, because given only one 
measurement, J{ and S are in fact the same model. In the above, 
we have used I and n to denote the likelihood function and the 
prior density, respectively. 

Now, let s e [0, 1] be a small threshold probability. We com- 
pare the probabilities of the models and S given the measure- 
ments nti and ntj. If P(Jl\nii, mj) < s, we say, that the model is an 
inadequate description of the data with a probability of 1 - s and 
that the model J{ cannot be used to model them both. In other 
words, the probability of model J[ is so small, that the measure- 
ments should instead be modelled using different parameters 0,- 
and 9j, i.e. using model < B. This condition is simply the c ommon 
Bayesian model selection criterion (e.g. iJeffrevsl 1196 ll) . From 
this condition and Eq. dA.U . and when selecting the prior prob- 
abilities of the two models equal, the comparison of models J{ 
and S according to Eq. ([TJ leads to 



P(m u mm) < -—P(mi\$l)P(mj\tf). 

1-5 



(A.2) 



We denote r = s(l - s)~ l and leave the model out of the notation 
by denoting P(m) = P(m\Jl) when it is clear which model has 
been used. Now, we define the model inadequacy as follows. 

The model used to describe the measurements m, and mj is 
not adequate with level r if 



P(mi,mj) 

B(m!,mi) := < r, 

^ P(mdP(mj) 



(A3) 



where the factor B is actually the Bayes factor in favour of model 
3\ and against model S and r is some (small) positive number 
corresponding to the selected threshold probability s. 

Because we have made no assumption on the exact nature 
of the measurements, the model, or the modelled system, the 
above condition applies to anything that can be measured and 
described with a statistical model. In fact, to be able to use the 
Eq. (I A. 3b . a sufficient condition is that the measurements m, and 
mj are modelled using statistical models that have at least one 
parameter, namely 9, in common. The model of the zth data set 
may have other parameters w, and these have to be treated as 
free parameters as well, but they have no role in the Eq. ( 1A.3I ) 
because they are independent of the other data set. 

The Eq. ( 1A.3I ) in fact states that the measurements are not 
distributed according to the model used. However, the converse 
is not true. If the condition in Eq. ( lA.3t does not hold for some 
measurements nti an d m j, it cannot be said that they are drawn 
from the same modelled density, even though it might be a rea- 
sonable assumption in practice. 

The Bayes factor in Eq. ( IA.3b has an interesting property 
when interpreted in terms of the informati on gain defined us- 
ing th e Kullback-Leibler (K-L) divergence (Kullback & Leibler, 
1 195 lb between prior and the posterior. The K-L divergence is 
defined for two continuous random variables with probability 
densities u(x) and v(x) as 



I 1 f u(x) 

Dkl \u(x)\\v(x) = u(x)\og——dx. 
1 ' J v(x) 



(A.4) 



2 The reader should refer to any basic text on conditional probabili- 
ties and independence. 



With this notation, we can write the K-L divergence of moving 
from the prior to the posterior (given both data sets). Hence, it 
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follows that 



n(8\mj,mj) 

m 



DKi{n{e\mum ] )\\n{ej\ = J n{ftm h mj) log 

= - log Pirn,, mj) + J n(6\m.i, mj) log l{m„ mj\9)d6 

ZC l{mi,mM 
I Tr(0\mi,mj}\og 



de 



l(m k \ff) 



-de 



<=> log B{m u mj) = D KL [n(e\mi,mj)\\n(e)} 

-D KL {n(e\m i7 mj)\\n{e\md) - D KL {n{e\ mi , mj)\\n{e\mj)}, (A.5) 

where we have used the Bayes rule and the facts that integral 
over a probability density equals unity and m, and mj are inde- 
pendent. 

This means that the logarithm of the Bayes factor used to 
determine the model inadequacy in describing measurements m, 
and mj can in fact be interpreted as the total information gain of 
the two measurements minus the information gains of moving 
from the posterior with respect to each measurement alone to 
the full posterior. 

Alternatively, the Bayes factor can be written using the infor- 
mation losses, or K-L divergences, of moving from the posteri- 
ors back to the prior (as opposed to the information gain of mov- 
ing from prior to the posterior). With this terminology, and using 
a similar derivation as for the information gain in Eq. ( IA.5I ), the 
expression in Eq. ( IA.5I ) can be replaced by 



(A.6) 



log B(m h mj) = D KL {n(e)\\jT(e\m h mj)} 
-D KL {n(e)\He\mi)) - D KL {n(6)\He\mj)}, 



which means that the logarithm of the Bayes factor can be inter- 
preted as the total information loss of the two measurements mi- 
nus the information losses of the two measurements separately. 

A.2. Multiple data sets 

When there are more than two data sets available, the model 
inadequacy criterion can be derived easily following the consid- 
erations in the previous subsection. For measurements ;«,, i = 
1 , . . . , N, it can be seen that 



P(m l ,...,m N \B) = Y[p{m i m. 



(A.7) 



It then follows that the model inadequacy criterion correspond- 
ing to that in Eq. dA.2b can be written as 



requirement cannot be considered very limiting, because in prac- 
tice, the data sets are commonly analysed separately anyway. 

In terms of K-L information loss of moving from the poste- 
rior to the prior, the Bayes factor B can again be interpreted in 
a simple manner using similar derivation as in Eq. (IA.5b . As a 
consequence, it follows that 

log B(m u ...,m N ) = D KL [n(e)\\n(6\m u -,m N )] j 

N 

-^DK^nmHOlmd}. (A. 10) 

i=l 

However, the information gains cannot be used in a similar man- 
ner as in Eq. ( IA.5b . Instead, using the information gain of the 
measurements the generalisation of Eq. (IA.5t to several mea- 
surements is 



log 



Hi B(m h (mi,..., m k , m N )\ k #) 



B(nt\, nifi) 
■ D KL {n(e\m x ,...,m N )MS)} 

N 

^] D KL \n(e\mi, m N )\\7T(e\mi, m k , m N )\ ki: ^, (A.l 1) 



where Z?(m,-, (m\, m^, m^k^d is the Bayes factor describ- 
ing the model inadequacy with respect to two data sets, namely, 
nij and the combined data set (mi, ...,m k , ■■■,m^)\i l ^i, which de- 
notes all the data except the measurement m,. 

Therefore, the Bayes factor determining the model inade- 
quacy in Eq. JA.9b can be interpreted as a measure of informa- 
tion loss that results from disregarding the measurements to gain 
information on the posterior minus the corresponding informa- 
tion losses of disregarding each measurement one at the time. 
Naturally, the gain and loss Eqs. dA. lib and (lA.lOb are equiva- 
lent if = 2, as was seen in the previous subsection. 

Assuming that B(tn\, m^) > 1, which means that models 
has a greater probability than S, has an interesting consequence. 
From this assumption, it follows that 

N 

D KL {n{6)\W8\mu -,m N )) > J] D KL [n(6)\ \n(0\m t )}. (A. 12) 

i=l 

When again interpreted in terms of information loss, this means 
that given a model that cannot be shown inadequate with r — 1 , 
the amount of information in the combined data set is greater 
than the information in the individual data sets. 



1=1 



(A.8) 



We again use B to denote the Bayes factor and write this criterion 
in the following way. 

The model used to describe measurements m,-, ...,m^ does 
not describe the measurements adequately accurately with level 
rif 



P(m l ,...,m N ) 
B(m u ...,m N ) := — — < r. 



(A.9) 



From the Eq. ( tA.9) . it can be seen that for data sets, the 
marginal integral needs to be determined N + I times to receive 
the Bayes factor that is used to assess the model inadequacy. This 



